Skip to content

Conversation

@theihor
Copy link
Contributor

@theihor theihor commented Sep 5, 2025

A number of fixes were implemented internally at Meta to the KPD instance supporting BPF CI.

This PR ports the fixes, see commits for details.

The fix in series to git branch mapping 88cb274 made _guess_pr() to
work incorrectly due to a wrong assumption there.  _guess_pr() looks
up a known open PR *by subject name*, which may not necessarily be the
right PR for a given series.

Before 88cb274 the branch name was (incorrectly) based on the oldest
series id.  However currently open PRs still use the wrong branch
name.  _guess_pr() then returns a wrong cached PR/branch.  This leads
to an exception in apply_push_comment(), because it tries to verify
that PR was successfully updated, but looks at a wrong PR.

Fix this by not using lookup in local cache by name in guess_pr().

Differential Revision: D79654888

Reviewed-by: Jordan Rome <[email protected]>
Signed-off-by: Ihor Solodrai <[email protected]>
Fix a dormant bug in guess_pr(): wrong base branch key has been used
for lookup of synced PRs.

Apparently this code path almost never executed until e28008d, which
removed another level of "cache" with lookup by name.

Differential Revision: D79693152

Reviewed-by: Jordan Rome <[email protected]>
Signed-off-by: Ihor Solodrai <[email protected]>
GithubSync.close_existing_prs_for_series() is supposed to close all
*other* PRs associated with a series being processed.  So far it only
looked at the *branch* name (`series/<id>=><base>`), when deciding
whether two PRs match.  This worked due to a bug, fixed in 88cb274,
which caused the oldest series id (branch name) to always be
associated with the series name (PR title).  Now that the bug is
fixed, we have to match by name during clean up, which is what happens
in this change.

Differential Revision: D79763361

Reviewed-by: Amery Hung <[email protected]>
Signed-off-by: Ihor Solodrai <[email protected]>
After recent fixes in subject-branch mapping, when syncing "old"
subjects a situation may occur when there are no relevant series for
an open PR. This caused the following error message:

```
E0807 21:21:26.207 branch_worker.py:863] BUG: Unable to find PR for selftests/bpf: Install test modules into $INSTALL_PATH https://patchwork.kernel.org/project/netdevbpf/list/?series=985731
```

Fix it by explicitly checking for latest_series() and closing the PR
if it doesn't exist.

Differential Revision: D79864591

Reviewed-by: Amery Hung <[email protected]>
Signed-off-by: Ihor Solodrai <[email protected]>
There is an intermittent email spam issue in rc: KPD tries to update
PR labels, say from ci-fail to ci-pass, and immediately updates them
back, causing an email notification on every iteration.

The logic in the code is reasonable: we remove a "not_label" first, if
it's in current labels, and add "new_label" and send a notification if
it's not in current labels.

The only odd thing that seems to be correlated with this behavior is
forward proxy errors, which suggests failures when changing the pr
(which makes calls github API).

I noticed that KPD is looking at 2-3s old PR object in
branch_worker.evaluate_ci_result().  While generally this shouldn't be
an issue, let's add an explicit pr.update() call to make sure up to
date labels are tested.

The hypothesis is that KPD misbehavior might be caused by failed (or
implicitly retried by proxy??) pr.remove_from_labels() /
pr.add_to_labels() calls.  If this is so, then adding pr.update()
would shift the failure/retry to a less disruptive GET operation,
which will "fix" KPD even if the proxy errors are still there.

Differential Revision: D79931497

Reviewed-by: Jordan Rome <[email protected]>
Signed-off-by: Ihor Solodrai <[email protected]>
The HTTP connections to GitHub are managed by pygithub library. It
uses a connection pool, initialized when Github object is created (in
KPD we have one per BranchWorker), and the connections are actively
reused.

KPD creates the BranchWorkers, and by extension the Github objects,
only once at start up. However KPD runs the main sync_patches()
routine once every two minutes, which means that actual underlying
connections are very likely to become stale.

Of course, there is an encapsulated logic in pygithub to re-establish
the connections. However it appears that attempting to use a
connection that was never explicitly closed, causes issues when
interacting with ttlsproxy. In particular we often get
AsyncSocketException, which appears to be correlated with KPD's
misbehavior with respect to email notifications.

Attempt to mitigate these problems by re-creating GithubSync object
(which spawns BranchWorker-s) on every iteration of the main daemon
loop (that is: every 2 mins).

Differential Revision: D80043971
Reviewed-by: Song Liu <[email protected]>
Signed-off-by: Ihor Solodrai <[email protected]>
@theihor theihor merged commit f27336c into main Sep 8, 2025
5 checks passed
@theihor theihor deleted the ported-fixes branch September 8, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants